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Abstract 


When a piece of malicious information becomes 
rampant in an information diffusion network, can 
we identify the source node that originally intro¬ 
duced the piece into the network and infer the 
time when it initiated this? Being able to do so 
is critical for curtailing the spread of malicious 
information, and reducing the potential losses in¬ 
curred. This is a very challenging problem since 
typically only incomplete traces are observed and 
we need to unroll the incomplete traces into the 
past in order to pinpoint the source. In this pa¬ 
per, we tackle this problem by developing a two- 
stage framework, which first learns a continuous¬ 
time diffusion network model based on historical 
diffusion traces and then identifies the source of 
an incomplete diffusion trace by maximizing the 
likelihood of the trace under the learned model. 
Experiments on both large synthetic and real- 
world data show that our framework can effec¬ 
tively “go back to the past”, and pinpoint the 
source node and its initiation time significantly 
more accurately than previous state-of-the-arts. 


1 INTRODUCTION 


On September 2014, a collection of hundreds of private pic¬ 
tures from various celebrities, mostly consisting of women 
and often containing nudity, were posted online, and later 
disseminated by users on websites and social networks 
such as ImguiQ Reddij^and Tumblip] flCedmey 2014]. Af¬ 
ter quite some efforts in manual tracing of the diffusion 


1 http: //imgur. com/ 
2 http://www.reddit.com/ 
3 https ://www. tumblr.com/ 
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paths, investigators found that the imageboard 4chaij^]was 
the culprit site where the photos were originally posted on 
August 31, even though the photos had been taken down 
from the site soon after their post. This leakage of private 
pictures has touched off a larger world-wide discussion and 
debate on the state of privacy and civil liberties on the In¬ 
ternet [ [Isaac] |20l4l . 

Can we automatically pinpoint the identity of such mali¬ 
cious information sources, as well as the time when they 
first posted the malicious information, given historical in¬ 
complete diffusion traces? Solving this source identifi¬ 
cation problem is of outstanding interest in many scenar¬ 
ios [Lappas et al.| |2010[ . For example, finding people that 
originate rumors may reduce disinformation, identifying 
patient zeros in disease spreads may help to understand and 
control epidemics, or inferring where a trojan or computer 
worm is initially released may increase reliability of com¬ 
puter networks. 

Related Work. The problem of finding the source of a 
diffusion trace, also called cascade , has not been studied 


until very recently |Lappas et al., 

12010 Shah and Zaman] 

2010 Aditya Prakash et al. 2012j 

Pinto etal. 2012]. How- 


ever, most previous work assumes that a complete steady- 
state snapshot of the cascade is observed, in other words, 
we know which nodes got infected but not when they did 
so. Moreover, previous work uses discrete-time sequen¬ 
tial propagation models such as the independent cascade 
model IKempe et al. |2003[ or the discrete version of the 


SIR model [Bailey 1975| , which are difficult to estimate 


accurately from real world data [Gomez-Rodriguez et al.| 
[2QTT]|Du et aLl|2QT2l[20T3bl|Zhou et al.||2013a|bl. 

Only very recently, Pinto et al. | 2012| consider a fairly gen¬ 
eral continuous-time model and assume that only a small 
fraction of sparsely-placed nodes are observed and, if in¬ 
fected, their infection time is observed. Unfortunately, their 
approach requires the distance between observed nodes 
to be large because they approximate the infection times 
by Gaussian distributions using the central limit theorem. 
Since this is easily violated in real social and information 


4 http://www.4chan.org/ 
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Figure 1: Spread of a rumor in a social network. Each 
edge weight is the time it took for a rumor to pass along the 
edge. Solid magnet edges indicate the actual path through 
which the rumor spreads. Green dashed edges are alterna¬ 
tive ways in which the rumor could have spread. The infec¬ 
tion times of Sophia and Liam are observed (red squares); 
the infection times of the remaining nodes are hidden (yel¬ 
low squares). How can an algorithm find that Jacob was the 
person who initiated the rumor? 


networks [Backstro m et al.| |20121, we find its performance 
on this type of networks underwhelming, as shown in Sec¬ 
tion |5] 


Challenges. Previous approaches failed successfully 
address several challenges of the source identification prob¬ 
lem, which we illustrate next using a toy example, shown 
in Figure [T] 


— Partially observed infections. It has become difficult, 
if not impossible, to collect complete diffusion traces, and 
track each individual infection in online social and infor¬ 
mation networks. This problem is exacerbated by the need 
to develop methods that can provide outputs in (almost) 
real-time. For example, Spinn3i[^] crawls only a subset of 
the blogs periodically; Twitter’s streaming API provides a 
small percentage (1%) of the full stream of tweets [ Morstat- 
ter et al., 2013 J; Facebook users typically keep their activ¬ 


ity and profiles private (Sadikov et al. 2011| . It is thus 
necessary to develop methods that are robust to missing 
data IChierichetti et al.| |2011| |Kim and Leskovec] |2011[ 
Sadik ov et al.| 2011| . Our toy example illustrates this 
challenge by considering the infection times of Liam and 
Sophia to be observed and all other infection times to be 
missing (hidden or unobserved). 


— Unknown infection start time. In most real-world scenar¬ 
ios, the exact time when a piece of malicious information 
starts spreading is unknown, and thus the observed infec¬ 
tion times have only relative meaning. In our toy example, 
we know Liam got infected 5 time units later than Sophia 
but we do not actually observe how much time has passed 
between Jacob’s infection, which triggered the spread, and 
Sophia’s infection. 


5 http://spinn3r.com/ 


— Uncertain transmission delay. The spread of informa¬ 
tion over social and information networks is a stochastic 
process. Therefore, we need to consider probabilistic trans¬ 
mission models to capture the uncertainty. For example, 
our toy example illustrates the spread of a particular rumor 
and therefore considers a set of fixed edge delays ( e.g ., the 
rumor took 5 time units to spread from Sophia to Ethan). 
However, the edge delays are stochastic and possibly dif¬ 
ferent for every particular rumor (e.g. , a different rumor can 
take more or less than 5 time units to spread from Sophia to 
Ethan). The edge delay densities, or transmission densities, 
may depend on parameters like the content of the rumor or 
the users’ influence. 


— Unknown infection path. In large real world networks, 
we will often encounter a large number of potential paths 
that may explain the spread of a rumor from a source node 
and any other node in the network. In fact, the set of po¬ 
tential paths increases exponentially with network size and 
network density and even simply counting the number of 
paths requires non-trivial met hods |Gomez-Rodriguez and 


|Scholkopf[|2012a|bl|Du et al.||2013a| . For example, in our 

toy example, Liam can become aware of the rumor through 
either Olivia or Emma. 


Our Approach. To tackle these challenges, we propose a 
two-stage scalable framework: we first learn a continuous¬ 
time diffusion network model based on historical diffusion 
traces and then identify the source of an incomplete diffu¬ 
sion trace by maximizing its likelihood under the learned 
model. The key idea of our framework is to view the prob¬ 
lem from the perspective of graphical models, and cast the 
problem as a maximum likelihood estimation problem, for 
which we find optimal solutions very efficiently using an 
importance sampling approximation to the objective and 
an optimization procedure that exploits the structure of the 
problem. Additionally, for networks with exponentially 
distributed edge transmission densities, used previously 
for modeling information propagation |Gomez-Rodriguez| 
et al.||20lT| , we show that the objective is a piece-wise uni- 
modal function with respect to the source’s infection time 
and develop a more efficient search procedure. 

For both synthetic and real-world data, we show that the 
framework can effectively “travel back to the past”, and 
pinpoint the source node and its infection time significantly 
more accurately than other methods. 


2 OUR FRAMEWORK 

Our framework for solving the source identification prob¬ 
lem consists of two main stages: it first learns a continuous¬ 
time diffusion network model based on historical diffusion 
traces, and then identifies the source of an incomplete dif¬ 
fusion trace and its initiation time by maximizing the its 
likelihood under the learned model. We start our expo¬ 
sition by revisiting the continuous-time generative model 
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for cascade data in social networks introduced in IGomez-l 
Rodriguez et al.|pOTT| , |Du et al.||2013a| . 


2.1 Continuous-Time Model for Cascades 

Given a directed contact network, Q = (V, £) with N 
nodes, a diffusion process begins with an infected source 
node s initially adopting certain contagion (idea, rumor or 
malicious piece of information) at time t s . The contagion 
is transmitted from the source along her out-going edges to 
their direct neighbors. Each transmission through an edge 
entails a random spreading time, r, drawn from a density 
over time, fji(r ; aji ), parametrized by a transmission rate 
otji. Then, the infected neighbors transmit the contagion to 
their respective neighbors, and the process continues. We 
assume transmission times are independent and nonnega¬ 
tive, in other words, a node cannot be infected by a node 
infected later in time; fji(r; aji) = 0 if r < 0. Moreover, 
an infected nodes remain infected for the entire diffusion 
process. Thus, if a node i is infected by multiple neigh¬ 
bors, only the neighbor that first infects node i will be the 
true parent. 

The temporal traces left by diffusion processes are often 
called cascades. A cascade t is an TV-dimensional vector 
t := (G,..., tjv) recording the times when nodes are in¬ 
fected, if so, i.e., ti G [0, oo], where T is the observation 
window cut-off and oo denotes nodes that did not get in¬ 
fected during the observation window. However, as noted 
above, in many scenarios, we only observe a subset of the 
infected nodes, O , while the state of all other nodes, H, is 
hidden (we assume the source node s G W). Our aim is 
then to find the source of a cascade s from the infection 
times {tj}j e o of the subset of infected nodes O. Figure[l] 
illustrates the observed data. 


2.2 Cascade Likelihood 

According to the conditional independence relation pro¬ 
posed in the continuous-time model for cascades, the com¬ 
plete likelihood of a cascade t (for both observed and hid¬ 
den nodes) factorizes as 

P(Mts) = P (^|{^j}jG7Ti) (1) 

ieoun 

where 7r* is the set of parents of i defined by the directed 
graph Q. For the infected nodes, 

(2011| showed that the likelihood 

P | F(fi tj •> QLji) ^ ^ Hjti 

j^TTi leni 

where Sji(r ; aji) = 1 — Fji(r ; aji) is the survival func¬ 
tion, Fji(r ; aji) = ajf)dt is the cumulative dis¬ 

tribution function, and H-At] an) = A^ T,aji \ i s the haz- 
ard function, or instantaneous infection rate. We will focus 
on the Weibull family of distributions fp (r; aji) since they 
have been shown to fit well real world diffusion data (Du| 


Gomez-Rodriguez et al. 
can be further written as 


|et al.||2013bl . In this case, 


fji{ T i aji) — 


kr k 


-l 


> V a ji J 


, Sji(r ; aji) = e 




j 1 


where k is a hyperparameter controlling the shape of the 
density. This family includes many well-known special 
cases, such as the exponential or Rayleigh distributions, 
which have also been used to model information propaga¬ 


tion over information networks [Gomez-Rodriguez et al., 
|20lQl|Du etaLl|2013b| . 

Unfortunately, to use Eq. [I] all infected nodes in a cascade 
need to be fully observed. If we only observe a subset O of 
the infected nodes, the likelihood of the incomplete cascade 
is computed as follows, 


p{{U) 


ieo 


I t s ) = f p(t\t s ) dt 
Jn 

-j 

Jn 


JJ P ]^[ (2) 

eoun jen 

which essentially marginalize out the time for all hidden 
nodes H over a product space U := [t s , oo)^L For sim¬ 
plicity of notation, we will omit the domain of the integra¬ 
tion in the remainder of the paper. 

The computation of the incomplete likelihood is a difficult 
high dimensional integration problem for continuous vari¬ 
ables. We will address this technical challenge using im¬ 


portance sampling in Section 3.2 


2.3 Learning Diffusion Networks 

Our framework relies on the assumption that it is possible 
to record a sufficiently large number of historical cascades, 
C, in order to discover the existence of all nodes in the net¬ 
work, to infer the network structure as well as the model pa¬ 
rameters, {aji}. We note that it is not necessary to record 
cascades that cover all nodes and edges, but each cascade 
has to be fully observed to a sufficiently large time period. 
Furthermore, all cascades collectively need to cover the en¬ 
tire diffusion network. Under the precise conditions stated 
in |Daneshmand et al“ 2014[ , one can infer the parame¬ 
ters of the continuous time model using an -regularized 
maximum likelihood estimation procedure. 


2.4 Cascade Source Identification Problem 

Given a learned diffusion model, our aim is to find the 
source node s of an incomplete cascade, such that the log- 
likelihood of the incomplete cascade is maximized. Thus, 
we aim to solve 

s* = argmax max P{{U} ie0 1^), (3) 

sen t s e(-oo,min ie oti) 

where p({ti} ie0 |4) is defined in Eq. [2j and we assume 
that t s < min i e oU- If we observe several independent 
incomplete cascades V , all triggered by the same source 
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node, we will maximize their joint likelihood 

s* = argmax |T I max C c I , (4) 

sen cev J 

where C c := p({tj}j e0 IL*). In the following sections, 
we will design algorithms to efficiently optimize the above 
objective and present experimental evaluations. 


3 APPROXIMATE OBJECTIVE 
FUNCTION 

There remain two technical challenges to solve to make 
our framework useful in practice. First, the likelihood of 
incomplete cascades, given by Eq. [2| is a difficult high di¬ 
mensional integration problem over a continuous domain. 
We overcome this difficulty by an approximation algo¬ 
rithm, based on importance sampling, which will greatly 
simplify the integration. Second, the inner-loop maximiza¬ 
tion over the source timing in Eq. [4] is non-convex. We 
solve this by designing an efficient algorithm, which finds 
the global maximum by exploiting the piece-wise structure 
of the problem. 

3.1 Importance Sampling 

Since an analytical evaluation of the integral in Eq.[2]is in¬ 
tractable, we turn to a Monte Carlo approximation. To do 
so, in principle, we need to draw samples from the poste¬ 
rior distribution of latent variables, p{{ti]i^u \t s , {U}ieo ), 
given the source time t s and the times of the observed 
nodes, {ti]i^o. However, it is very challenging to sample 
from this posterior distribution, and we will instead address 
the problem by designing an efficient importance sampling 
approach. 

More specifically, we first introduce a set of auxiliary ran¬ 
dom variables {rji} ie0 , where each variable corresponds 
to one observed infected node, with an arbitrary joint prob¬ 
ability distribution q{{Vi} ie0 )- I n the next sec tion, we will 
briefly discuss how q is chosen. Then, given the auxiliary 
distribution we have 

P({ti} ie o \ts) = / P({*i}ieou« \ f s) II dti 

J ien 

= / p({u] ieoun I t s )q({Vi} ieo ) n dti JJ drji. (5) 

J ien ieo 


Second, we introduce the proposal distribution for im¬ 
portance sampling on the auxiliary and hidden variables, 
q{{Vi}ieo ? {^i)ien)' Then, the integral becomes 

■pm ieoun I ta)q({vi} ieo ) 


p({u}i 


eo 


\ts)-j 


9({»7«}ieo > (*<}<€«) 

QdVi} ieo ’ {ti}ien) n dti JJ dri 
L pm. 


4e : 
1=1 


ien ieo 


ieo J 


QiW^ieo’&he-H) 


( 6 ) 


where we draw L samples from q({rji} ie Q ^{ti} ien ) to 
approximate the integral. Now, we have an approximation 
to jC c . Next, we explain how to choose the proposal and 
the auxiliary distributions. 


3.2 Choice of Proposal Distributions 

We define our proposal distribution using the forward- 
generative process of the cascades. Our proposal distri¬ 
bution q({r]i} ie0 , {ti} ien ) will sample cascades from the 
learned continuous diffusion network model with s as the 
source set. One of the interesting properties of this pro¬ 
posal distribution is that many terms involving the latent 
variables in Eq. [ 6 ] will be canceled out and hence the for¬ 
mula will become simpler. 

We remind the reader that the independent cascade model 
has a useful shortest-path property |Du et al.||2013a| , which 
allows us to sample the parents’ infection times, 
for each node j efficiently for different source infection 
times t s . More specifically, we first sample a set of trans¬ 
mission times {r uv }( U:V )e£, one per edge, independent of 
each other. Then, the time ti taken to infect a node i is 
simply the length of the shortest path in Q from the source 
s to node i, where the edge weights correspond to the as¬ 
sociated transmission times. Let Qi(s) be the collection 
of directed paths in Q from the source 5 to node i, where 
each path q G Qi(s) contains a sequence of directed edges 
(j, m), and assume the source node is infected at time t S9 
then we obtain variable U via 

(^mh,m)es\s) ■= q mm } £(., m)6 , r jU (7) 

where <^(-) is the value of the shortest-path. 

This above relation is key to speed up the evaluation of the 
sampled likelihood in Eq. [ 6 ] for different t s values. First, 
the sampled transmission times r uv are independent and 
thus can be sampled in parallel. Second, we can reuse 
the sampled transmission times r uv for different t s values 
and sources s, since the transmission times are indepen¬ 
dent of t s and s. We only need to compute the infection 
time ti for each node using t s = 0 , and then for a different 
value of t s , the infection time is just an offset by t s . Third, 
the likelihood of a sampled cascade ({v l i} ie0 5 {4} 
for l = 1,..., L can be simply computed using Eq.[l]as 
PiWi} ie0 i {A}ien IO» which is independent of the ac¬ 
tual value of t s and depends only on the identity of the 
source node s. 


3.3 Choice of Auxiliary Distribution 

The auxiliary distribution q{{Vi} ie o) is chosen to be 
equal to p({^} ieC , | {ti] ien ). In other words, our 
auxiliary distribution will simply sample cascades from 
the learned continuous diffusion network model with 
H as the source set. Here, it is is easy to see that 
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Algorithm 1 Our source detection algorithm 


Require: C,V,L 

Infer transmission rates A from C using |Daneshmand et al. 2014[ Algorithm 1]. 

Sample L sets of transmission times 

Compute infection times t\ eV , l = 1,..., L assuming t s = 0 using Eq. [ 7 ] 

Compute change points: ti — E, i G 7r*, l = 1,..., L and U - tLj e M,i e ttj n O, l 


for i G M do 

t* = argmax ts 4>L(t s ) (using line search method or Lemma[2]in each piece) 

end for 

5 * = argmax iGM fait*) 
t s * = max iGM 0 l(**) 


1,- 


..,L. 


/p({»7i}i€0 I {*<}<€«) IIi€O d, ?i = L With the above 
choices for the proposal and auxiliary distribution, we can 
greatly simplify the approximate likelihood in Eq.[6]into 

1 L 

Mts) = ^ o) ^ 

i=neo 

rj P ^i I hj'ljeirAo ’ ^Ae^no) 

where M is the set of hidden nodes with observed variables 
as parents, 

M:={ieH\7TinO^Q}, ( 9 ) 

which is typically much smaller than the overall set of hid¬ 
den nodes. It is noteworthy that, under mild regularity con¬ 
ditions, the Monte Carlo approximation of the integral will 
converge to the true value with sufficient number of sam¬ 
ples. However, a clever choice of the proposal distribution 
makes the convergence faster and the computation more ef¬ 
ficient. 

4 MAXIMIZE OBJECTIVE FUNCTION 

Our objective function, given by Eq.|4j consists of an inner 
and an outer maximization. In the inner maximization, we 
leverage the Monte Carlo sample approximation and solve 

max (j) L {t s ). (10) 

In the outer maximization, we rank all possible source 
nodes, s, in terms of their best starting time t s , which is 
the solution to the inner maximization, and then select the 
top source node in the ranking as our optimal source, s*. 

The outer maximization is straightforward, however, the in¬ 
ner maximization, which consists of finding the optimal t s 
that maximizes 0^(4), defined in Eq. [I] may seem difficult 
at first. Although it is a 1-dimensional problem, the objec¬ 
tive function is piece-wise continuous and non-convex with 
respect to t 3 . This is because by increasing (or decreasing) 
t s , the parent-child relation between nodes may change. 
However, there are two key properties of 0 l(£ s ), which 
allow us to carry out the optimization efficiently. First, 
(f)L(t s ) is piece-wise continuous and the number of such 


pieces increases as O(LAN), i.e., linearly in the number of 
Monte Carlo samples, the number of observed nodes, and 
the maximum in-degree, A, of the observed nodes. Sec¬ 
ond, within each piece, the maximum of the function can 
be found efficiently. 

4.1 Finding Each Continuous Piece 

In this section, we aim to efficiently find all the change 
points t s . in the approximated likelihood </>z,(£ s ), given by 
Eq. [8] In other words, we will efficiently find the left and 
right end points of each of its continuous pieces. Here, we 
assume there is a directed path in Q from the source s to 
each of the observed infected nodes O , otherwise, it cannot 
be a source for those nodes, trivially. 

The key idea to finding all change points is realiz¬ 
ing that each piece in Eq. [5] corresponds to a differ¬ 
ent feasible parents-child configuration. Here, by fea¬ 
sible parents, we mean parents that get infected ear¬ 
lier than the child and thus are temporally plausible. 
More specifically, given a source s , Eq. [8] is composed 
of three types of terms: p(U\ {^} ie ,.\ 0 , folj-g^no) 
and p{t \| {t l j } :je7riAO , no), which depend on the 

source time value t 3 , as we will realize shortly, and 
thus are responsible for the change point values t Si , and 
P &I ’ {^} jewin0 )> which does not depend on 

t s , because both {r] l i } ie0 an d are sampled, t s 

equally shifts all sampled times and its likelihood is time 
shift invariant. Based on the structure of the first two type 
of terms, it is easy to show that at each change point t Si , 
there is a node j G O U Ad, observed or hidden, that 
changes its set of feasible temporally plausible parents, i.e., 
a parent of one observed or hidden node becomes (stops be¬ 
ing) a feasible parent at time t s . . Therefore, it is clear that 
there are O(LAN) change points, where A is the maxi¬ 
mum in-degree of nodes. Next, we describe a procedure to 
find all change points efficiently. 

Efficient Change Point Enumeration. We start by setting 
t s = 0 and computing the infection time, denoted as f(, 
for each hidden node j G M and realization l using the 
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Figure 2: Evolution of the proposed method with respect to 
the number of cascades. 


found within e-neighborhood of t* with only 
2 log( ts ' +1 £ tsz )/log(3/2) evaluations ofej)L (-). 

Furthermore, by utilizing golden section search |Kiefer| 
1953|, one can further reduce the complexity of finding the 

t —t 

optimum point to log( Si+1 e ——) / log(1.618) evaluations. 
We summarize the overall algorithm in Algorithm [I] 


shortest path property described in Section 3.2 Then, we 
find the change points in which an observed node i G O 
looses feasible parents by computing the time difference 
U ~ t l j,3 C 7Ti\0, l = 1,..., L, and the change points in 
which a hidden node j G M earns feasible parents by com¬ 
puting the time differences U—tpi G njCiO fa = 1,..., L. 
If a time difference is negative, we skip it, since the associ¬ 
ated parent will never (always) be feasible, independently 
of the t s value. 

Additionally, we can compute </>l(-) efficiently for each 
change point t Si , since at each change point t Si , we will 
only need to revaluate the corresponding terms to the node 
i G O U M that changes its set of feasible parents. In the 
case of exponential transmission likelihoods, once we have 
computed the likelihood at each change point t Si , we can 
re-evaluate it at any time t G [t Si fa Si+1 ), by multiplying 
the corresponding terms in the approximated likelihood by 
e t ~ ts i . 

4.2 Maximizing within Each Piece 

Once we have delimited each piece of the approximate like¬ 
lihood given by Eq. [ 8 j we can find the times t s that max¬ 
imize the likelihood in each piece efficiently, using well- 
known line-search procedures for one-dimensional contin¬ 
uous function, such as the forward-backward method, the 
golden section method or the Fibonacci method [Luen 
berger, p~973| . However, in the case of exponential trans 
mission likelihoods, we can perform the maximization step 
even more efficiently. 

Exponential Transmission. We start by realizing that, in 
the case of exponential transmission functions, the approx¬ 
imate likelihood given by Eq.[ 8 ]can be expressed as 

L 


Hits) = y2 


lie 1 


Pits 


(ID 


1 = 1 


where 7 / > 0 and fa are independent of t s . Then, we can 
prove that each piece of the approximate likelihood is uni- 
modal (proven in Appendix [A|: 


Lemma 1 0l(^s) A uni-modal in t Si <t s <t 


s i+1 - 


5 EXPERIMENTS 

We evaluate the performance of our method on: (i) syn¬ 
thetic networks that mimic the structure of social net¬ 
works and (ii) real networks inferred from a large cascade 
dataset, using a well-known state-of-the-art network infer¬ 
ence method [ Gomez-Rodriguez et al. 20111 . We show 
that our approach discovers the true source of a cascade or 
set of cascades with surprisingly high accuracy in synthetic 
networks and quite often in real networks, given the diffi¬ 
culty of the problem, and significantly outperforms several 
baselines and two state of the art methods [ Ad itya Praka sh 
et al.||2012||Pinto et al.||2012| . Appendix [Cjprovides addi- 
tional experimental results. 

5.1 Experiments on Synthetic Data 

Experimental Setup. We generate three types of Kro- 
necker networks [Leskovec et al., 2010]: (/) core-periphery 
networks (parameter matrix: [0.9 0.5; 0.5 0.3]), which 
mimic the information diffusion traces in real world net¬ 
works | Gomez-Rodriguez et al.( |2010[ , (ii) random net¬ 
works ([0.5 0.5; 0.5 0.5]), typically used in physics and 
graph theory | Easley and Kleinberg 2010[, and (ii i) hier¬ 
archical networks ([0.9 0.1; 0.1 0.9]) | Clauset et al.[|2008) . 
We then set the pairwise transmission rates of the edges 
of the networks by drawing samples from a ~ U( 10 , 5). 
For each type of Kronecker network, we generate 10 net¬ 
works with 256 nodes and 512 edges. Finally, for each net¬ 
work, we generate a set of cascades from ten different ran¬ 
dom sources s*. Since we are interested in detecting source 
nodes of large cascades, we only consider source nodes that 
triggered at least ten large cascades out of 100 simulated 
cascades. Given the size of the networks we experiment 
with, we consider a cascade to be large if it contains more 
than 40 nodes. Our aim is then to find the source of a large 
cascade or small set of large cascades from the infection 
times of a small (unknown) fraction of all infected nodes. 
In all the following experiments the sample size is 400 and 
10 % of the infected nodes are observed except when it is 
explicitly mentioned. 


Now, we can find the maximum of ^l(-) by only evalu¬ 
ation of the function on a sequence of points (proven in 
Appendix [Bj: 

Lemma 2 The maximum point of ^l(-) can be 


A Toy Example. We first consider a small 64-node hier¬ 
archical Kronecker network and visualize the approximate 
likelihood given by Eq. [5] against the number of observed 
cascades for each node in the network. We use 150 Monte 
Carlo samples. Figure [2] summarizes the results, where 
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(a) SP, Random 


(b) SP, Core-Periphery 


(c) Top-10 SP, Random 


(d) Top-10 SP, Core-Periphery 


Figure 3: Success Probability (SP) and Top-10 Success Probability (Top-10 SP) for two types of Kronecker networks. 


each square represents a node, the true source is marked 
with a star and the heat map represents normalized like¬ 
lihoods in [0,1]. In this toy example, a single cascade is 
insufficient to detect the true source, since it has a rela¬ 
tively low likelihood. However, once more cascades are 
observed, the likelihood of the true source increases and 
ultimately become higher than all other nodes for 8 cas¬ 
cades. 


Accuracy. Next, we evaluate the accuracy of our 
method in comparison with two state of the art methods, 
Net Sl euth |[Aditya Prakash et al. 2012) and Pinto’s 
method |Pinto et ak |2Q12| ? and two baselines in larger syn¬ 
thetic networks. The first baseline runs Montecarlo from 
each potential source and ranks them by counting the av¬ 
erage maximum number of observed infected nodes that 
get infected in a time window equal to the length of ob¬ 
servation window. Then, it ranks the potential sources ac¬ 
cording to the average value of this quantity, where the 
node with the highest value is the top node. The second 
baseline first finds all potential sources that can reach all 
observed infected nodes and then ranks them by decreas¬ 
ing out-degree, where the node with the highest out-degree 
is the top node. Net Sleuth assumes the same infection 
probability (3 over all the edges, which we set to 0.1, fol¬ 
lowing |Aditya Prakash et sT. | [2012| . Pinto’s method sim¬ 
ilarly assumes that all pairwise transmission times come 
from the same Gaussian distribution and they require its 
mean to be much larger than its standard deviation in order 
to guarantee nonnegative transmission times. In their work, 
they set fij a = 4, where /i and a are the mean and standard 
deviation, respectively. Since our fitted diffusion network 
contains edges with different transmission rates and thus 
different expected transmission time, we set the parameter 
H to be the minimum expected value over all the edges. 


We used two measures of accuracy: success probability and 
top-10 success probability. We define success probability 
as P(s = 5*) and top-10 success probability as the proba¬ 
bility that the true source 5* is among the top-10 in terms of 
maximum likelihood or ranking. For each network type, we 
estimated both measures by running our method on 10 dif¬ 
ferent random source sets. Since Net Sleuth and Pinto’s 
method can only accept one observed cascade at a time, we 
run the methods independently for each individual cascade 




(a) Random (b) Core-Periphery 


Figure 4: Mean-squared error (MSE) on the estimation of 
t s for two types of Kronecker networks. 


and then compute the top-1 and top-10 success probability 
based on all outputs for all cascades. Figure [3] summarizes 
the results for two types of Kronecker networks against the 
number of observed cascades. Our method outperforms 
dramatically all others, achieving a success probability as 
high as 0.6 and top-10 success probability of almost 1. The 
low performance that state-of-the-art methods exhibit, in 
comparison with the validation within the corresponding 
papers, may be explained as follows: in both cases, the 
authors validated their algorithms with synthetic and real 
networks with large diameters, without long-range connec¬ 
tions, such as 2-D grids |Aditya Prakash et aLj 2012) and 
spatial (geographical) networks [Pinto et al. 2012| , where 
the source identification problem is much easier. 


Source Infection Time Estimation. We also evaluate 
how accurately our method infers the infection time of the 
true source by computing the mean square error (MSE), 
E s * [(£ s * — t s * ) 2 ], estimated by running our method on 10 
different random sources. Here, we do not compare with 
other competitive methods since they do not provide an es¬ 
timate of the infection time of the true source. Figure [4] 
shows the MSE of the estimated infection times of the true 
source for the same networks as above against the number 
of cascades. 


5.2 Experiments on Real Data 

Experimental Setup. We focus on the spread of memes, 
which are a short textual phrases (like, “lipstick on a pig”) 
that travel almost intact through the Web [Leskovec et al., 
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Figure 5: Success Probability (SP), Top-10 Success Probability (Top-10 SP) and mean-squared error (MSE) on the estima¬ 
tion of t s for real cascade data. 


2009 ]. We experiment with a large meme datasej^] which 


traces the spread of memes across 1,700 popular main¬ 
stream media sites and blogs [ |Gomez-Rodriguez et al.| 
2013]. The dataset classifies memes per topic, and asso¬ 


ciates each meme m to an information cascade t m , which 
is simply a record of times when sites first mentioned meme 
m. We proceed as follows. We first infer an underlying 
diffusion network per topic using NetRate, a well-known 
network inference method | |Gomez-Rodriguez et al. 2011| , 
using all observed information cascades. We then use these 
inferred networks along with a percentage of the infections 
of large cascades to infer the source of these cascades. We 
select 15 sources, each of them having at least 10 long cas¬ 
cades. Here, by long cascade we mean possessing more 
than 27 nodes. The results are averaged over 5 runs, ran¬ 
domizing the selection of the observed nodes, we consider 
that 10% of the infected nodes are observed and utilize 500 
samples to approximate the likelihood. 


Accuracy. We evaluate the accuracy of our method in com¬ 
parison with Net Sleuth and the same baselines as in the 
synthetic experiments, using success probability and top- 
10 success probability. Unfortunately, we cannot compare 
to Pinto’s method because it requires the identity of the 
true parent for each observed node in each cascade, and 
this is not available in real cascade data. Figure [5] sum¬ 
marizes the results. Surprisingly, neither Net Sleuth nor 
the baselines succeed at detecting cascade sources in real 
data, even with 10 observed cascades; they output solu¬ 
tions with an (almost) zero (top-10) success probabilities. 
In contrast, our method achieves a non-zero (top-10) suc¬ 
cess probability as long as we observe more than 8 and 6 
cascades respectively, a fairly low number of cascades in 
this scenario. Even then, the performance of our method in 
terms of success and top-10 success probability may seem 
low at first, however, we would like to highlight how dif¬ 
ficult the problem we are trying to solve is, by consider¬ 
ing the performance of two simple random guessers. A 
first random guesser who chooses the source uniformly at 

6 Data is available at http://snap.stanford.edu/infopath/ 


random from all nodes in the network would succeed with 
probability 1/1700 = 5.8 x 10 -4 , almost 20 times less 
accurate than our method. A second random guesser that 
chooses the source uniformly at random among the nodes 
from whom the observed nodes are reachable would suc¬ 
ceed with probability 1/425 = 2.4 x 10 -3 , almost 5 times 
less accurate. The same argument for top-10 success prob¬ 
ability shows 12 times improvement in accuracy compared 
to naive guesser and 3 times improvement in comparison 
to the more clever one. Finally, our method’s MSE values 
indicate that our method is able to find the source infec¬ 
tion time within an accuracy of V2000 « 45 days. We find 
this quite remarkable given that the cascades we considered 
typically unfold during a 1-year period. 


6 CONCLUSIONS 


We propose a two-stage framework for detecting the 
source of a cascade in continuous-time diffusion networks, 
which improves dramatically over previous state-of-the- 
arts in terms of detection accuracy. Our framework cast 
the problem as a maximum likelihood estimation prob¬ 
lem and then find optimal solutions very efficiently us¬ 
ing an importance sampling approximation to the objec¬ 
tive and an optimization procedure that exploits the struc¬ 
ture of the problem. Our work opens many interesting 
venues for future work. For example, it would be useful 
to extend our method to support cascades with multiple 
sources and other continuous-time models different than 
the continuous-time independent cascade model [ Gomez 
Rodri guez et al.||20TT| . Also, a theoretical analysis of our 
importance sampling scheme is also interesting. Finally, 
it would be interesting to apply the current framework to 
other real-world datasets. 
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Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades 


A Proof of Lemma [T] 

Suppose there are two stationary points, i.e., <fi' L (x ) = (j)' L {y ) = 0, thus, by continuity of </>£,(.) in ( t Si , t Si+1 ) there must be 
a z £ (x, y ) such that <Pl(z) = 0. We show it is a contradiction as 

L 

= 5 ^ 7 iP?e 0lts > 0 ( 12 ) 

l 

for all 1 < l < L. 


B Proof of Lemma |2] 


Assume we would like to find the maximizer of in interval (a, b ) and consider two points at one-third and two-third of 

the interval, i.e., c = a + and d = a + 2. It can be easily shown that, if (c) < </>l (d), then the maximizer will be 
on interval (c, b) and, if ), then the maximizer must lie on interval (a, d). Therefore, by two evaluations, we 

can shrink the interval containing the maximizer by a factor of |. Then, to reach the e-neighborhood of the real maximizer, 
we need evaluate the function 2 * r times, where 


This will prove our claim. 


(t Si+1 -t Sz )(2/3) r <e. 


(13) 


C Additional Experimental Results 

In this section, we provide additional experimental results on synthetic data, including an evaluation of the performance of 
our method against the percentage of observed infections and the number of Montecarlo samples, as well as a scalability 
analysis. 

Performance vs. percentage of observed infections. Intuitively, the greater the number of observed infections, the more 
accurately our method can infer the true source and its infection time. Figure [6] confirms this intuition by showing the 
success probability against percentage of observed infections. However, we also find that the greater is the percentage of 
observed infections, the smaller is the effect of observing additional infections; a diminishing return property. 



Figure 6: Accuracy vs. % observed infections. 


Performance vs. number of Montecarlo samples. Drawing more transmission time samples ,i)e£ l ea ds to abetter 

estimate of Eq. [6} and thus a greater accuracy of our method. Figure [7] shows the success probability against number of 
samples. Importantly, we observe that as long as the number of samples is large enough, the performance of our method 
quickly flattens and does not depend on the number of samples any more. 

Running time vs. percentage of observed infections. Figure [8] plots the average running time to infer the source of a 
single cascade against the percentage of observed infections. Perhaps surprisingly, the running time barely increases with 
the percentage of observed infections. 

Running time vs. number of samples. Figure [9]plots the average running time against the number of Montecarlo samples 
used to approximate the likelihood, Eq. [6] 
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Figure 7: Accuracy vs. number of samples. 



(a) Random (b) Hierarchical (c) Core-periphery 

Figure 8: Running time vs. % observed infections. 
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Figure 9: Running time vs. number of samples. 
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Toy example. We consider the same 64-node hierarchical Kronecker network as in Section pT] and visualize the approx¬ 
imate likelihood given by Eq. [8] against number of observed cascades (C = 1 ,..., 8) for each node in the network using 
150 Monte Carlo samples. 
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Figure 10: Evolution of the proposed method with respect to the number of cascades. 


Accuracy on a hierarchical Kronecker network. We additionally evaluate the accuracy of our method in comparison 
with the same two state of the art methods and two baselines as in Section [5TT| in a Kronecker hierarchical network. Fig¬ 
ure r 


11 shows the success probability (SP) and top-10 success probability, and mean squared error (MSE) on the estimation 


of t. 
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Figure 11: Success Probability (SP), Top-10 Success Probability (Top-10 SP) and Mean-squared error (MSE) on the 
estimation of t s for a hierarchical Kronecker network. 























